-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Reduce the number of distinct binaries to reduce CI / compile time #18289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Reduce the number of distinct binaries to reduce CI / compile time #18289
Conversation
|
Would you be able to provide a high level overview of the changes made here? The PR seems pretty big and it's hard to review without some description of the changes made |
|
High-Level Overview This PR consolidates multiple standalone example binaries into a single example binary that uses subcommands to run individual examples. The main goal is to reduce the number of distinct binaries and make the examples easier to organize and maintain. Key Changes
Example Usage To run a specific example, use: cargo run --example datafusion -- datafusion |
|
I’ve noticed that the CI cargo audit check is failing due to the following advisories:
My PR doesn’t introduce or modify any of these dependencies - they appear to be transitive dependencies already present in the project. Could someone from the PMC confirm whether these advisories should be handled separately (e.g., ignored in CI or tracked in a dedicated issue)? |
bc2f0a6 to
0a9cbe2
Compare
It was fixed: #18288 |
…e-number-of-distinct-binaries
|
The merge conflict in encrypted.rs has been resolved, and all checks have passed. Ready for review. |
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up.
I think rust_example.sh needs to be fixed to run the examples in this new format.
At least from the compilation artifacts it seems a nice win:
# off main
datafusion (main)$ du -s -h ~/.cargo_target_cache/ci/examples/
5.0G /Users/jeffrey/.cargo_target_cache/ci/examples/
# off this PR
datafusion (pr_18289)$ du -s -h ~/.cargo_target_cache/ci/examples/
1.4G /Users/jeffrey/.cargo_target_cache/ci/examples/|
|
||
| //! # Advanced UDF/UDAF/UDWF/Asynchronous UDF Examples | ||
| //! | ||
| //! This example demonstrates advanced user-defined functions in DataFusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| //! This example demonstrates advanced user-defined functions in DataFusion. | |
| //! These examples demonstrates advanced user-defined functions in DataFusion. |
| #[tokio::main] | ||
| async fn main() -> Result<()> { | ||
| let arg = std::env::args().nth(1).ok_or_else(|| { | ||
| eprintln!("Usage: cargo run --example advanced_udf -- [udf|udaf|udwf|async_udf]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's better to construct the options from ExampleKind so if we add examples in future there's one less place to forget to update 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point
| - [`sql_dialect.rs`](examples/sql_dialect.rs): Example of implementing a custom SQL dialect on top of `DFParser` | ||
| - [`sql_query.rs`](examples/memtable.rs): Query data using SQL (in memory `RecordBatches`, local Parquet files) | ||
| - [`date_time_function.rs`](examples/date_time_function.rs): Examples of date-time related functions and queries. | ||
| - [`advanced_udaf.rs`](examples/advanced_udaf/udaf.rs): Define and invoke a more complicated User Defined Aggregate Function (UDAF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to uplift this doc to be a table instead, so it's easier at a glance to see which command to run for which example, e.g.
| Group | Example |
|---|---|
| advanced_udaf | udf |
| advanced_udaf | udwf |
| query_planning | analyzer_rule |
Or would this cause too much duplication of information and be prone to drift?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes a lot of sense to create a table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to consolidate the examples anyways, I wonder if it makes sense to consolidate them all into a single (or even fewer) binaries?
| AsyncUdf, | ||
| } | ||
|
|
||
| impl FromStr for ExampleKind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder if there's a way to templatize/abstract this whole file so they don't drift in structure from the others 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - indeed, the example entry files follow a similar pattern. It could make sense to refactor this into a shared helper or macro in the future to keep them in sync. For now, I kept this consistent with the existing example structure.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @cj-zhukov -- this is an exciting idea but I think it is going to be very hard to review a PR of this size (it will require a substantial amount of contingiuous review time, which is quite hard to find)
Is there any way to break it up into smaller pieces?
For example, perhaps you could consolidate all the flight examples into a single example binary as a single PR?
Then we can make sure we are agreed on the pattern and then we can apply it to the remaining examples
| - [`sql_dialect.rs`](examples/sql_dialect.rs): Example of implementing a custom SQL dialect on top of `DFParser` | ||
| - [`sql_query.rs`](examples/memtable.rs): Query data using SQL (in memory `RecordBatches`, local Parquet files) | ||
| - [`date_time_function.rs`](examples/date_time_function.rs): Examples of date-time related functions and queries. | ||
| - [`advanced_udaf.rs`](examples/advanced_udaf/udaf.rs): Define and invoke a more complicated User Defined Aggregate Function (UDAF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to consolidate the examples anyways, I wonder if it makes sense to consolidate them all into a single (or even fewer) binaries?
Thank you for the feedback! That makes perfect sense - I agree it’ll be much easier to review and discuss the approach in smaller steps. |
|
@Jefffrey @alamb That’s a great point - I agree that consolidating further could simplify maintenance and make it easier for new users to explore examples. For now, I’ll start with a smaller PR focusing on the This way, we can gradually reduce the number of binaries while keeping reviews manageable. |
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - part of ##18142. ## Rationale for this change As discussed in #18289 this PR is for consolidating all the `flight` examples into a single example binary. Then we can make sure we are agreed on the pattern and then we can apply it to the remaining examples <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Sergey Zhukov <szhukov@aligntech.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - part of #apache#18142. ## Rationale for this change As discussed in apache#18289 this PR is for consolidating all the `flight` examples into a single example binary. Then we can make sure we are agreed on the pattern and then we can apply it to the remaining examples <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Sergey Zhukov <szhukov@aligntech.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
|
I think the @cj-zhukov is breaking this one up into more smaller PRs so marking this one as draft |
Thanks, @alamb ! Yes, that’s correct - I’m currently splitting this work into smaller PRs. I’ll continue updating and consolidating all the remaining examples as part of this effort. |
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - part of #apache#18142. ## Rationale for this change As discussed in apache#18289 this PR is for consolidating all the `flight` examples into a single example binary. Then we can make sure we are agreed on the pattern and then we can apply it to the remaining examples <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Sergey Zhukov <szhukov@aligntech.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…18142)
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?